대용량 문서 데이터 셋에서 메타데이터를 활용한 문서 유사도 계산 성능 향상

김정은; 이재길; Jungeun Kim; Jae-Gil Lee

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document :

한글제목(Korean Title)	대용량 문서 데이터 셋에서 메타데이터를 활용한 문서 유사도 계산 성능 향상
영문제목(English Title)	Improving the Performance of Calculating Document Similarity Using Metadata in Large-Scale Datasets
저자(Author)	김정은 이재길 Jungeun Kim Jae-Gil Lee
원문수록처(Citation)	VOL 30 NO. 01 PP. 0089 ~ 0097 (2014. 04)
한글내용 (Korean Abstract)	대용량 문서 데이터 셋에서 모든 문서 쌍에 대한 유사도를 측정하는 것은 계산상 오버헤드가 매우 크다. 하지만 유사할 가능성이 높은 문서 쌍을 예측하고 유사할 가능성이 현저히 낮은 문서 쌍을 계산 전에 제거 한다면 계산상 효율을 크게 향상 시킬 수 있다. 본 논문에서는 대용량 문서 데이터 셋에서 메타데이터를 활용하여 문서 유사도 계산 성능을 향상시키는 방법을 학술 논문 데이터 셋을 중심으로 제안한다. 문서의 메타데이터란 문서를 기술한 데이터로 문서의 속성 정보를 내포하며 학술 논문의 경우에는 제목, 발행처, 저자 등이 있다. 학술 논문 간 관련성을 발행처 정보와 저자 정보를 이용하여 정의하고 관련성이 낮은 학술 논문들은 유사도 계산에서 제외함으로써 효율성을 높인다. 42만개의 대용량 학술 논문 데이터 셋에 대해 실험을 수행하였으며 제안하는 방법이 일반적인 방법보다 197배 높은 성능을 보임을 확인하였다.
영문내용 (English Abstract)	Calculating document similarity for every pair of documents in a large-scale document collection introduces high computational overhead. However, efficiency can be improved if we are able to predict dissimilar document pairs and remove those pairs before the calculation. In this paper, using the metadata of documents, we develop an efficient method of calculating document similarity for a huge number of documents, especially academic papers. The metadata of documents describes the documents using the attributes of a document, e.g., for academic papers, the title, venue, author, and so on. We define the relevancy between academic papers using the venue and author information and exclude irrelevant document pairs in order to boost the efficiency. We conducted extensive experiments using 0.42 million academic papers. The results demonstrated that our proposed method outperformed a baseline method by 197 times.
키워드(Keyword)	대용량 문서 데이터 셋 메타데이터 문서 유사도 학술 논문 Large-Scale Document Dataset Metadata Document Similarity Academic Paper
파일첨부	PDF 다운로드